Add vectorization in elementwise_util #9432

swolchok · 2025-03-20T00:36:21Z

This is a first cut at #9241 . In this PR I've vectorized op_mul to make sure that vectorization doesn't break tests; a follow-up PR will make all existing portable ops vectorized-capable. I've left covering ops that use the unary_ufunc_* utilities in pattern.h for a follow-up push, because pattern.h and elementwise_util need some work before we can migrate pattern.h's utilities to be backed by elementwise_util.

[ghstack-poisoned]

swolchok · 2025-03-20T00:36:21Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-03-20T00:36:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9432

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 70 New Failures

As of commit 0beabbb with merge base 1572381 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
Lint / lintrunner / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / android / build-llm-demo / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-custom-ops-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-eval_llama-mmlu-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-eval_llama-wikitext-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-llama_runner_eager-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_mul.cpp:102:50: error: invalid operands to binary expression ('const at::vec::Vectorized<bool>' and 'const CTYPE_COMPUTE' (aka 'const bool'))
pull / test-llama-runner-linux (bf16, custom, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
/usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04... / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu... / linux-job (gh)
/usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::[ 51%] Building C object backends/xnnpack/third-party/XNNPACK/CMakeFiles/microkernels-prod.dir/src/f32-vcopysign/gen/f32-vcopysign-scalar.c.o
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu-22.04-... / linux-job (gh)
/usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-llama-runner-linux-android / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-llava-runner-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (add_mul, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (add_mul, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (add, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (add, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (emformer_join, portable, linux.4xlarge.memory) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
pull / test-models-linux (emformer_join, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
pull / test-models-linux (emformer_transcribe, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux (emformer_transcribe, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (ic3, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (ic3, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
pull / test-models-linux (linear, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux (linear, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (llama3_2_vision_encoder, portable, linux.4xlarge.memory) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (mv2, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux (mv2, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
pull / test-models-linux (resnet18, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (resnet18, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux (resnet50, portable, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux (resnet50, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux (w2l, portable, linux.4xlarge.memory) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_div.cpp:198:50: error: invalid operands to binary expression ('const at::vec::Vectorized<float>' and 'const CTYPE_COMPUTE' (aka 'const float'))
pull / test-models-linux-basic (mv3, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux-basic (mv3, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux-basic (mv3, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
/usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
/usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-models-linux-basic (vit, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux-basic (vit, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux-basic (vit, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
/usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
/usr/include/c++/11/bits/std_function.h:435:9: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>; _Constraints = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl<c10::BFloat16, c10::BFloat16, torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes> >(const torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>)::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-moshi-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-openvino-linux / linux-job (gh)
/usr/include/c++/9/bits/std_function.h:667:7: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = c10::BFloat16; CTYPE_OUT = c10::BFloat16; Op = torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>; <template-parameter-2-2> = void; <template-parameter-2-3> = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = c10::BFloat16; CTYPE_OUT = c10::BFloat16; Op = torch::executor::native::addmm_out(executorch::runtime::KernelRuntimeContext&, const Tensor&, const Tensor&, const Tensor&, const Scalar&, const Scalar&, torch::executor::native::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:48, auto:49)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>, std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-phi-3-mini-runner-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_sub.cpp:120:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-pybind-build-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-quantized-aot-lib-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / test-selective-build-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / test-setup-linux-gcc / linux-job (gh)
/usr/include/c++/9/bits/std_function.h:667:7: error: ‘std::function<_Res(_ArgTypes ...)>::function(_Functor) [with _Functor = torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = double; CTYPE_OUT = double; Op = torch::executor::native::add_scalar_out(executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, const executorch::runtime::etensor::Scalar&, const executorch::runtime::etensor::Scalar&, executorch::runtime::etensor::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:54)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>; <template-parameter-2-2> = void; <template-parameter-2-3> = void; _Res = void; _ArgTypes = {long int, long int}]’, declared using local type ‘torch::executor::native::utils::internal::dtype_specialized_elementwise_fn_impl(const Op&, executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, Args ...) [with CTYPE_COMPUTE = double; CTYPE_OUT = double; Op = torch::executor::native::add_scalar_out(executorch::runtime::KernelRuntimeContext&, const executorch::runtime::etensor::Tensor&, const executorch::runtime::etensor::Scalar&, const executorch::runtime::etensor::Scalar&, executorch::runtime::etensor::Tensor&)::<lambda()>::<lambda()>::<lambda(auto:54)>; Args = {std::pair<const executorch::runtime::etensor::Tensor*, torch::executor::native::utils::SupportedTensorDtypes>}]::<lambda(auto:26, auto:27)>’, is used but never defined [-fpermissive]
pull / test-static-llama-qnn-linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / unittest / linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'double')
pull / unittest / macos / macos-job (gh)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'CTYPE_COMPUTE' (aka 'double'))
pull / unittest-arm / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / unittest-buck / linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / unittest-buck / macos / macos-job (gh)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / unittest-editable / linux / linux-job (gh)
/pytorch/executorch/kernels/portable/cpu/op_addmm.cpp:96:26: error: invalid operands to binary expression ('const at::vec::Vectorized<c10::BFloat16>' and 'const CTYPE' (aka 'const c10::BFloat16'))
pull / unittest-editable / macos / macos-job (gh)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/kernels/portable/cpu/op_add.cpp:112:24: error: invalid operands to binary expression ('const at::vec::Vectorized<double>' and 'CTYPE_COMPUTE' (aka 'double'))

This comment was automatically generated by Dr. CI and updates every 15 minutes.

this works with op_mul, which is vectorized-friendly, but doesn't work when we roll out to pattern.h because those ops will not work with Vectorized yet. See TODO in elementwise_util.h ghstack-source-id: 30d2311bed080c3a5390ab00ca20a1e33563f077 ghstack-comment-id: 2738665976 Pull Request resolved: #9432

this works with op_mul, which is vectorized-friendly, but doesn't work when we roll out to pattern.h because those ops will not work with Vectorized yet. See TODO in elementwise_util.h ghstack-source-id: 8d76653f819dc58a0c93540f3d71a89bfdb7cd26 ghstack-comment-id: 2738665976 Pull Request resolved: #9432

swolchok · 2025-04-02T02:34:04Z

This PR is not yet complete (I need to go through and make sure no ops are needlessly held back from vectorization, and I need to overload unary - on at::vec::Vectorized so that op_sigmoid.cpp can vectorize nicely), but it builds and should vectorize a large chunk of elementwise portable ops. Hopefully this makes the direction/vision more clear @kimishpatel

(CI failures are expected; the builds are contingent on pytorch/pytorch#150380 which is not yet actually merged into PyTorch core. We'll need a pin bump.)

swolchok · 2025-04-02T02:51:23Z

kernels/portable/cpu/op_where.cpp

+        [](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b, const CTYPE_COMPUTE val_c) {
          return val_c ? val_a : val_b;
        },


(ATen, and therefore our own optimized op_where, doesn't vectorize this)

kimishpatel · 2025-04-02T03:05:11Z

where do you expect we would have to say not want to vectorize something?

we should not bother generating vectorized loops at least when at::vec::Vectorized doesn't have accelerated support for the architecture we are targeting. (I need to update this PR). It is also something we may want to let the user disable to squeeze on size, though I think this is a weaker reason.

I am not fully sold but I havent looked at the PR in detail so I will further comment after that. Maybe you are right in that not everything will fit the Vectorized pattern and for those cases your pattern of checking callable for vectorized is probably a better way to do it.

this works with op_mul, which is vectorized-friendly, but doesn't work when we roll out to pattern.h because those ops will not work with Vectorized yet. See TODO in elementwise_util.h ghstack-source-id: 8d76653f819dc58a0c93540f3d71a89bfdb7cd26 ghstack-comment-id: 2738665976 Pull Request resolved: #9432

[ghstack-poisoned]

this works with op_mul, which is vectorized-friendly, but doesn't work when we roll out to pattern.h because those ops will not work with Vectorized yet. See TODO in elementwise_util.h ghstack-source-id: 033b63ce3bee8a0136efdab3e03905cafb79b915 ghstack-comment-id: 2738665976 Pull Request resolved: #9432

kimishpatel

Overall this makes sense but I left some questions around why we need can_use_vectorized based approach

kimishpatel · 2025-04-02T21:16:56Z

kernels/portable/cpu/util/vectorized_math.h

+  template <typename T>                                                  \
+  auto func_name(at::vec::Vectorized<T> vec) {                           \
+    if constexpr (!::executorch::runtime::is_floating_point<T>::value) { \
+      return at::vec::convert<float>(vec).func_name();                   \


is this a valid thing to do? that is convert say an instance of at::vec::Vectorized<int8_t> to float and apply the func_name? Maybe i am misunderstanding how this works

kimishpatel · 2025-04-02T21:17:39Z

kernels/portable/cpu/util/vectorized_math.h

+ */
+#define ET_INTERNAL_VECTORIZED_FLOAT_UNARY_FUNC(func_name)               \
+  namespace executorch {                                                 \
+  inline namespace math {                                                \


what does inline to namespace do?

kimishpatel · 2025-04-02T21:18:26Z

kernels/portable/cpu/util/vectorized_math.h

+ * corresponding operator is a "float op" in TensorIterator parlance
+ * (i.e., uses something like build_borrowing_binary_float_op()),


I dont know if anyone reading this code would understand what this means?

But this provides answer to my earlier question

kimishpatel · 2025-04-02T21:21:29Z

kernels/portable/cpu/op_where.cpp

@@ -47,7 +47,7 @@ Tensor& where_out(
        CTYPE_COMPUTE,
        op_name,
        utils::SupportedTensorDtypes::SAME_AS_COMMON>(
-        [](const auto val_a, const auto val_b, const auto val_c) {
+        [](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b, const CTYPE_COMPUTE val_c) {


is this unrelated change?

no, we can't vectorize this op

kimishpatel · 2025-04-02T21:22:44Z